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Abstract 



One of two independent stochastic processes (arms) are to be selected at each of n stages. 
The selection is sequential and depends on past observations as well as the prior information. 
Observations from arm i are independent given a distribution Pi, and, following Clayton and 
Berry (1985), Pi's have independent Dirichlet process priors. The objective is to maximize 
the expected future-discounted sum of the n observations. We study structural properties 
of the bandit, in particular how the maximum expected payoff and the optimal strategy 
vary with the Dirichlet process priors. The main results are (i) for a particular arm and 
a fixed prior weight, the maximum expected payoff increases as the mean of the Dirichlet 
process prior becomes larger in the increasing convex order; (ii) for a fixed prior mean, 
the maximum expected payoff decreases as the prior weight increases. Specializing to the 
one-armed bandit, the second result captures the intuition that, given the same immediate 
payoff, the more is known about an arm, the less desirable it becomes because there is less 
to learn when selecting that arm. This extends some results of Gittins and Wang (1992) on 
Bernoulli bandits and settles a conjecture of Clayton and Berry (1985). 
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1 Introduction 

Bandit problems are classical problems in statistical decision theory and have received consid- 
erable attention; see Berry and Fristedt (1985) for an overview. We consider discrete-time, 
finite-horizon, two-armed bandits from a Bayesian perspective. At each of n stages, an observa- 
tion is taken from one of two stochastic processes (arms). A strategy specifies which process to 
select based on past observations. The objective is to maximize the expected payoff, Yli=i a i^ii 
where Zi is the observation at stage i and A n = (aq, 02, • • • , a n ) is a discount sequence satisfying 
di > and Yli=x Oi > 0. A strategy is optimal if it achieves the maximum expected payoff. 
An arm is optimal initially if there exists an optimal strategy that selects that arm at the first 
stage. 

The most widely studied bandit problem is the Bernoulli bandit, where each arm generates 
a sequence of exchangeable Bernoulli random variables. Bernoulli bandits are important as 
a model for clinical trials. Others such as normal bandits have also been extensively studied 
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(Chernoff 1968). Extending the Bernoulli bandit, Clayton and Berry (1985) have introduced 
a one-armed Bayesian nonparametric bandit using Dirichlet process priors (Ferguson 1973). 
Chattopadhyay (1994) extends this and studies the two armed Dirichlet bandit, which is also 
the setting of this work. Associated with arms 1 and 2 are probability measures Pi, i = 1,2, 
respectively. Observations from arm % are independent samples given Pj; observations from 
different arms are independent. The Pj's themselves are treated as random, with independent 
Dirichlet process priors. Specifically, Pi ~ DP(«i), where a, is a finite nonnull measure with a 
finite first moment. It is often helpful to write on = MiFi where Mj = «i(R) so that F{ is a 
probability distribution. We refer to Fi and Mi as the prior mean distribution and prior weight 
of the Dirichlet process, respectively. We use (ol\, ot2', A n ) to denote such a Dirichlet bandit with 
discount sequence A n . 

For such problems one must balance the desire to maximize the immediate payoff and the 
need to explore a less known arm in the hope of higher payoff later on (the exploitation versus 
exploration dilemma). Optimal strategies are usually specified through backward induction and 
are nontrivial to compute. Nevertheless certain structural properties such as the stay-on-a- 
winner rule (Bradt, Johnson and Karlin 1956; Berry 1972) often hold under suitable conditions. 
For Dirichlet bandits with known arm 2, Clayton and Berry (1985) obtain several structural 
results. In particular, the maximum expected payoff increases as F±, the mean of the Dirichlet 
process prior for arm 1, increases in the usual stochastic order. Also, a version of the stay-on- 
a-winner rule holds: if arm 1 is optimal initially then it is optimal at the next stage provided 
that the initial observation from arm 1 is sufficiently large. Such results have been extended to 
the general two-armed Dirichlet bandits (Chattopadhyay 1994). 

This paper studies further structural properties of Dirichlet bandits, in particular how the 
value of the bandit (i.e., the maximum expected payoff) varies with the Dirichlet process priors. 
The main results are (i) the value increases as the mean of the Dirichlet process for any arm 
becomes larger in the increasing convex order (defined below) ; (ii) the value decreases as the prior 
weight of the Dirichlet process of an arm increases. The second result agrees with the intuition 
that, given the same immediate payoff, an arm is less appealing when more is known about 
it, because there remains less to be explored. Though easy to state and intuitively appealing, 
such results are often difficult to prove. We mention a long-standing conjecture of Berry (1972), 
which states that for a finite-horizon Bernoulli two-armed bandit with uniform discounting and 
independent Beta(uj,t>i) priors, i = 1,2, for arms 1 and 2 respectively, if u\/v\ = U2/V2 and 
ui + vi < U2 + V2, then arm 1 is preferred to arm 2 at the initial pull. If, instead of finite-horizon 
uniform discounting, we assume infinite-horizon geometric discounting, then the corresponding 
conjecture is true, as shown by Gittins and Wang (1992), who also prove analogous results for 
some other parametric bandits. Geometric discounting is special in that the optimal strategy for 
a multi-armed bandit is characterized by a "dynamic allocation index," or Gittins index (Gittins 
and Jones 1974; Gittins 1979; Whittle 1980), which reduces the problem to several one-armed 
bandits. 

As the Bernoulli bandit is a special case of the Dirichlet bandit, our results may be regarded 
as a generalization of Gittins and Wang (1992), although our method of proof, based on convexity 
and stochastic orders, is different. Our main result (Corollary[2]) confirms a conjecture of Clayton 
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and Berry (1985) concerning the break-even value in the one-armed Dirichlet bandit. We also 
prove another conjecture of Clayton and Berry (1985) concerning the break-even observation 
when both arms are optimal initially (Proposition [T]). These results will hopefully shed some 
light on the conjecture of Berry (1972). See Herschkorn (1997) for related results and conjectures 
on the Bernoulli bandit. 

We find the usual stochastic order, the convex order and the increasing convex order par- 
ticularly helpful in formulating and deriving the main results. For random variables Z\ and Z 2 
taking values on R, we write Z\ < st Z 2 (respectively, Z\ < cx Z 2 ), if 

E<f>{Zx) < E0(Z 2 ) (1) 

for every increasing (respectively, convex) function such that the expectations exist. If Z\ < s t 
Z 2 then we also say Z 2 is to the right of Z\ . We say Z\ is smaller than Z 2 in the increasing convex 
order, written as Z\ <; cx Z 2 , if (HJ) holds for every increasing and convex function (j) such that the 
expectations exist. Hence <; cx is implied by either < st or < cx . The convex order is concerned 
with variability. For example, if Z\ < cx Z 2 , both with finite second moments, then EZ\ = EZ2 
and Var{Z\) < Var(Z 2 ). Another basic property is closure under mixtures: if distributions 
Fi, d, z = 1,2, satisfy F l < cx F 2 and G x < cx G 2 then pF 1 + (1 - p)G l < cx P F 2 + (1 - p)G 2 , p € 
[0, 1]; closure under mixtures also holds for <; cx and < st . (We use the notation < st , < cx; <kx 
with distribution functions as well as random variables.) For further properties and applications 
of various stochastic orders, see Miiller and Stoyan (2002) and Shaked and Shanthikumar (2007). 

2 Prior mean monotonicity 

Let us denote the maximum expected payoff of a two-armed Dirichlet bandit (ai,a 2 ; A n ) by 
W(a\,a 2 ; A n ). Let W l (a\, a 2 ; A n ) be the expected payoff when selecting arm i initially and 
using an optimal strategy thereafter. Then 

W{a 1 ,a 2 ;A n ) = max{W 1 (a 1 ,a 2 ; A n ), W 2 (a 1 ,a 2 ; A n )} . (2) 

Suppose arm 1 is selected initially, resulting in an observation X. Because the prior on Pi is 
a Dirichlet process, the posterior is again a Dirichlet process DP(«i + 5x), where 5 X denotes a 
point mass at x. Thus we have 

W 1 {a 1 ,a 2 ]A n ) = aim + E [W(ai + 8 X , a 2 ; ati] , (3) 
W 2 ( ai , a 2 ; A n ) = oi/i 2 + E [W(a u a 2 + 5 Y ; A n )\ a 2 ] , (4) 

where A n = (a 2 ,as, . . . ,a n ) and pi denotes the first moment of «j, which is also the expected 
value of an observation from arm i. In £7[<7(Jf)|a], the distribution of X is a/M with M = a(R). 
The quantities W, W 1 and W 2 are well defined and finite as long as «j, i = 1,2, have finite first 
moments, which we assume throughout. 

Lemma [1] reveals a convexity property of W which we shall use repeatedly. 

Lemma 1. Let a be a finite measure on R with a finite mean. Then, for u, v € R and r > 0, 
the function W(a + p5 u + (r — p)8 v , a 2 ; A n ) is convex in p € [0, r}. 
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Proof. Let us use induction on n. It is easy to check that the claim holds for n = 1. For n > 2, 
we note that by (|2|) it suffices to show that each of W t (a + p5 u + (r — p)5 v ,a 2 ; A n ), z = 1,2, is 
convex in p G [0, r]. Since the mean of a + p£ u + (r — p)S v is linear in p, by ([3]) and (|4|), we only 
need to show that both 

E [W(a + p5 u + (r - p)5 v + 5x,a 2 ;Al l )\a + p5 u + (r - p)5 v ] and (5) 

E [W(a + p5 u + (r - p)S v , a 2 + 5 Y ; A l n ) \ a 2 ] (6) 

are convex in p. Convexity of (|6|) follows from the induction hypothesis. To deal with ([5]), we 
directly compute 

E[W(a + p5 u + (r - p)5 v + 5 X , a 2 ; A\)\a + p5 u + (r - p)^] 
M 



M + r 



E[W{a + p5 u + (r - p)^ + 5 X) a 2 ;^)|a] (7) 



+l) + (r-p)0( jO ) 

H 5m - ; ' 

M + r 

where M = a(R) and 

0(p) =W r (a + p(5 u + (r + l- / 9)^,a 2 ;^). 

By the induction hypothesis, 4>(p) is convex in p € [0, r + 1]. We claim that this implies that 
ip(p) = p4>{p + 1) + (r — p)(p(p) is convex in p £ [0, r]. In fact, if <p(p) is twice differentiable, then 
we have 

4>"{p) = 2(cP'(p + 1) - cj>\p)) + p<^"(p + 1) + (r - p)<f,"{p) > 0, p G [0, r], 

by the convexity of (p. A standard limiting argument shows that ip(p) is convex in p € [0, r] as 
long as 4>{p) is convex in p G [0, r + 1] without assuming differentiability. Hence the second term 
([8]) is convex. The first term ([7]) is convex in p G [0, r] by the induction hypothesis, since in this 
expectation X is distributed according to a/M independently of p. Thus the convexity of ((5|) is 
established. □ 

Theorem [T] says that the value of the bandit increases as the mean of the Dirichlet process 
prior for any arm becomes stochastically larger and more dispersed. This strengthens Proposi- 
tion 2.2 of Clayton and Berry (1985) who consider the usual stochastic order rather than the 
increasing convex order. 

Theorem 1. If M > and F <i cx F, both with finite means, then 

W(MF,a 2 ;A n ) < W(MF, a 2 ; A n ). 

Proof. Let us use induction. The claim obviously holds for n = 1. For n > 2 we have 
W 2 (MF,a 2 ;A n ) < W 2 (MF, a 2 ; A n ) by (gj and the induction hypothesis. Moreover, 

W 1 (MF, a 2 ;A n ) = ai E(X\F) + E[W(MF + S x ,a 2 ; A l n )\F] 

< aiE(X\F) + E[W(MF + d x ,a 2 ;Al)\F] 

< ai E{X\F) +E[W{MF + 5 x ,a 2 ;A 1 n )\F] 
= W\MF,a 2 ;A n ), 
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where the first inequality follows from F <i cx F and the induction hypothesis, noting that 
(MF + 5 X )/(M + 1) < icx (MF + 6 X )/(M + 1) for any x; the second inequality holds by the 
definition of <; cx , because W(MF + 5 X , a 2 ; A\) i s an increasing, convex function of x. To show 
this, fix -co < u < v < oo. It is easy to show (MF+8 u )/(M+l) < icx (MF+6 V )/(M+1), which, 
by the induction hypothesis, implies W(MF + 5 U , a 2 ; A\) < W(MF + 5 V , a 2 ; A^). Moreover, 

W(MF + 5 U , a 2 ; A\) + W(MF + 5 V , a 2 ;A l n ) 

>2W(MF + (6 u + S v )/2,a 2 ;Al) 
>2W(MF + 6 iu+v y 2 ,a 2 ;Al), 

where the first inequality follows from Lemma[H and the second inequality holds by the induction 
hypothesis, noting that 

MF + 5 {u+v)/2 < MF + (5 U + 5 v )/2 
M + l ~ icx M + l 

Hence W(MF + 5 X , a 2 \ A\) is convex in x as needed. □ 

Remark 1. Theorem [1] extends to bandits with more than two arms. That is, the maximum 
expected payoff increases when the mean of the Dirichlet process prior for any arm becomes larger 
in the increasing convex order. We present the two-armed version for notational convenience. 
The discount sequence in Theorem Q] is very general, i.e., we only assume A n is nonnegative. By 
approximation, this can be further extended to the infinite-horizon case assuming a « < °°- 
Similar comments apply to Theorem [2] in Section 3. 

When arm 2 has a known distribution P 2 with mean A, the problem reduces to a one-armed 
bandit. Without loss of generality we may assume the known arm yields a constant payoff 
A at each stage, i.e., we consider the (a,5\;A n ) bandit (the subscript on a\ is dropped for 
convenience). It is well known that, assuming the discount sequence is regular in the sense 
that (J2i>j+i a i) 2 — (J2i>j a i)(J2i>j+ 2 a i) f° r au J — 1' this one-armed bandit is an optimal 
stopping problem, i.e., if at any stage it is optimal to pull arm 2 then arm 2 should be used 
in all subsequent stages; see Berry and Fristedt (1979). If A n is regular, then there exists a 
break-even value A(a;A n ) for the (a,5\;A n ) bandit, such that arm 1 is optimal initially if and 
only if A < A(a;A n ) and arm 2 is optimal initially if and only if A > A(a;A n ). For infinite- 
horizon geometric discounting, this break-even value is also known as the dynamic allocation 
index or Gittins index (Gittins and Jones 1974). The following result holds by the optimal 
stopping characterization and is stated for uniform discounting as Lemma 2.1 in Clayton and 
Berry (1985). 

Lemma 2. If A n is regular, then A(cn; A n ) is the smallest A such that W(a, 5\\A n ) < A Ya=1 a *- 

Lemma [2] and Theorem [1] yield the following result comparing A(a;A n ). 

Corollary 1. For M > and F <; cx F, both with finite means, we have A(MF;A n ) < 
A(MF; A n) , assuming A n is a regular discount sequence. 
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Suppose A n is regular. Monotonicity and continuity considerations (see Clayton and Berry 
1985) show that, for the (a,5 x ;A n ) bandit there exists a break-even observation b(a;A n ) such 
that if both arms are optimal initially, and an observation x is taken from arm 1, then arm 1 
remains optimal if x > b(a; A n ) and arm 2 becomes optimal if x < b(a; A n ). That is, 

A(a;A n ) > K(a + 5 X ; A\), if x < b(a; A n ); 
A(a; A n ) < A(a + S x ; A^), if x > b(a; A n ). 

Calculating this break-even observation is nontrivial. In the case of uniform discounting, Clayton 
and Berry (1985) prove an upper bound for b(a;A n ) and conjecture that b(a; A n ) > A(a;A n ) 
based on numerical evidence. We confirm this in Proposition [TJ 

Proposition 1. Suppose n>2 and A n is regular and all positive. Then b(a;A n ) > A{a;A n ). 

As noted by Berry and Fristedt (1985; p. 131), Proposition Q] has an intuitive interpretation. 
Suppose both arms are optimal initially, and arm 1 is selected. If the initial pull on arm 1 yields 
no more than A(a;A n ), which is the yield of arm 2 per pull, the hope of getting higher payoff 
fades. Not surprisingly, arm 2 becomes optimal afterwards. This suggests that the break-even 
observation is at least A(a;A n ). 

To prove Proposition [T] we need a lemma. 

Lemma 3. For c > 0, A G R and an arbitrary discount sequence A n , we have 

W(a + c5 x ,5 x ;A n ) < W(a, 

Proof. We use induction on n. The n = 1 case is easy. Suppose n > 2. Let us write M = a(R) 
and let \x be the first moment of a. Direct calculation using d21) — (0J) yields 

W(a + c5\,5\;A n ) = maxj ^^^^ 1 , 2 | , (9) 

where 

00 = ani + E \W{a + c8 x + 5 x ,h;A l n )\a] ; 

01 = aiX + W(a + (c + 1)S X , 5 X ; A\); 

02 = aiA + W(a + c5 x ,5 x ;A l n ). 

Applying the induction hypothesis, and then ([2|) and ©, we get 

0o < axii + E\W{a + 8 x ,8x,A^)\a\ 
<W(a,8 x ;A n ). 

Applying the induction hypothesis, and then ([2j) and (J!]), we get 

0i < 02 < aiA + W(a, 5 X ; A l n ) < W(a, 5 X ; A n ). 
That is, 4>i < W(a, 5 X ; A n ) for i = 0, 1, 2. Hence the claim holds by □ 

Proof of Proposition^ Suppose A = A(a;A n ). By the optimal stopping characterization, we 
have W(a, 5 X ; A^) = A J27=2 a i- Lemma [3] yields W(a + 5 X , 5 X ; A^) < A J2i=2 a i- ^ follows from 
Lemma[2]that A > A(a + 5 X ; A^). That is, A(a; A n ) > A(a + 5 X ; A\), which implies A < b(a;A n ) 
(under the assumptions b(a;A n ) is unique). □ 
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3 Prior weight monotonicity 



The main result of this section (Theorem[2|) shows that the maximum expected payoff of a bandit 
decreases as the prior weight for the Dirichlet process prior of an arm increases. When arm 2 is 
known and the discount sequence is regular, this shows that the break-even value A(M\Fi; A n ) 
decreases as M\ (the prior weight associated with arm 1) increases. That is, given the same 
immediate payoff, arm 1 becomes less desirable as the amount of information about it increases. 

Theorem 2. Let F be a probability distribution on R with a finite mean. If < M < M then 

W(MF, a 2 ; A n ) > W(MF, a 2 ; A n ). (10) 

Lemma [2] and Theorem [2] yield the following result concerning the break-even value A (a; A n ) 
for the one armed bandit (a, 5\; A n ), as conjectured by Clayton and Berry (1985) in the case of 
uniform discounting. 

Corollary 2. For < M < M we have A(MF;A n ) > A(MF;A n ), assuming A n is a regular 
discount sequence. 

When F has only two support points, Corollary 2 says that for a Bernoulli one-armed bandit 
with a Beta(Mu, Mv) prior, u, v > 0, for the unknown arm, the break-even value decreases in 
M. This Bernoulli case was proved by Gittins and Wang (1992) for infinite-horizon geometric 
discounting. 

The rest of this section gives a proof of Theorem [2j We assume F has finite, and then 
bounded, and finally arbitrary, support. The key step is summarized as Lemma 01 

Lemma 4. Assume n > 2, L > 0. Assume a is a finite measure on R with a finite mean 
and F is a probability distribution on R with s < oo support points. Then E[W(a + OF + (L — 
Q)b~X: Ci2] A\)\F] decreases in 9 € [0,L]. 

Proof. We use induction on s. Although the induction may start at the trivial case s = 1, we 
present the s = 2 case to illustrate the convexity arguments. Write F = p8\ + (1 — p)5q where 
p € (0, 1) and {0, 1} are the support points without loss of generality. For fixed < 9\ < 62 < L, 
let Z ~ Bernoulli (p) and define 

Z i = 8 i p+{L-8 i )Z, i = l,2. 

Then EZ\ = EZ2 = pL, and it is easy to verify Z2 < cx Z\ as Q\ < O2 (see, e.g., Shaked and 
Shanthikumar 2007, Theorem 3. A. 18). Let us define 

4>(u) = W(a + u5i + (L - -u)5 ,a2;^n)- 

By direct calculation 

E [W (a + 9 X F + (L - 9 1 )5 x ,a 2 ; A\) \F] =p<j)(e l p + L-9 1 ) + {1- p)<t>{6 x p) 

=E<j } {Z l ) 
>E<P{Z 2 ) 

=E [W (a + 9 2 F + (L-9 2 )5x,a 2 ;Al l ) \F] 
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where the inequality holds because Z 2 < cx Z\ and, by Lemma [U <f>(u) is convex in u G [0, L\. 

For s > 3, write F = Ylj=iPj^Xjj where {xj, j = 1, . . . ,s} are the support points, pj > 
and Ylj=iPj = !• Consider the leave-one-out distributions 

Denote W(7) = W(7, «2; -^n) f° r convenience. For fixed < Q\ < 9 2 < L, we have 
(s - 1) £ [W (a + 0iF + (L - 6*i)5x)| F] 



= £(1 - p fc )£ [W (a + (9iF + (X - 

fc=i 

s 

= 5^(1 - p fc ) £ (a + + 0i (1 - + (L - 

fe=i 

s 

> Y^l - p k ) E \w fa + + 2 (1 - p fc )F fc + (L - 2 (1 - - lPfc )5 x 



fc=i 



pk 
(11) 



fc=i jyjj 



where 



Fjfc = W ( a + 2 7 ife + 0iPA k + (L- 6 2 (1 - p k - Pj) - 1 p k )5 Xj ) , 
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The inequality (jlip follows from the induction hypothesis; other steps are algebraic manipula- 
tions. 

For fixed j ^ k, let Z ~ Bernoulli (pk/(Pj + Pk)) an d define 

Zi = ^ipfc + z(l -e 2 + (e 2 - e^ipj + Pk )y, 

Z 2 = e 2 Pk + Z(L-9 2 ). 

It is easy to verify that 

EZ\ = EZ 2 ; Z 2 < cx Z\. 

We have 

PjV jk + p k V kj = { Pj + p k )EW (a + 9 2 ^ k + + (L - 2 (1 - p k - Pj ) - Z{]8 x ^j 
> ( Pj + Pk )EW (a + 9 2 ^ k + Z 2 4 fc + (L - 2 (1 - Pk ~ Pj) ~ Z 2 )5 X 
= Pj W (a + 6 2 F + (L — 6 2 )5 Xj ) + Pk W (a + 6 2 F + (L — 9 2 )5 Xk ) , 
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where the inequality holds by Lemma [T] as Z 2 < C x Z\. Hence, 

s 

zZJ2 p ^ k = Yl (PjVjk + PkV kj ) 

k=lj^k l<i<fc<« 

^ S \- p i W ( a + ° 2F + ( L " 02 )^) + PkW ( a + ° 2F + ( L ~ 9 ^u)] 

l<j<k<s 

s 

=( s - 1) y,pj w (« + e ^ F + ( l - ^)<y 

= (s - l)F[W(a + 2 F + (L - 2 )5x)|F]. 
Thus we have shown that E[W(a + 9F+(L- 6)5 X )\F] decreases in 6 € [0, L}. □ 

Proof of Theorem^ (i) Assume F has finite support. The claim obviously holds for n = 1. For 
ti>2we use induction. In view of ©-(jU), we only need to show 

E[W(MF + 5x,a 2] Al)\F}>E\w(MF + Sx,a 2 ]A l n )\F] and (12) 
E[W(MF,a 2 + 5 Y ;A l n )\a 2 ] > E \w(MF,a 2 + 5 Y ; A l n )\a 2 ] . (13) 

By the induction hypothesis, (fT3j) holds. Define 77 = (M + 1)/(M + 1) and 9 = M/rj. Noting 
M < 6 < M + 1, we may apply Lemma [Hand get 

E [W(MF + 5 x ,a 2 ; A l n )\F] >E [W{6F + (M + 1 — 0)5*, a 2 ; ^)|F] 

>F [W( ?? (0F + (M + 1 — 0)5x),a 2 ; 4j|F] (14) 



W(MF + <5 x ,a 2 ;^)|F 



where (|14p holds by the induction hypothesis, as rj > 1. Thus (|12p holds as required. 

(ii) Assume F has bounded support. Then for arbitrary e > we can construct two distribu- 
tions F* and F* supported on {xi, . . . , x s } and {xo, . . . , x s _i} respectively, where Xj = xo + je, 
such that F(xq) = 0, F(x s ) = 1 and F*(xj) = F*(xj-i) = F(xj), j = 1, . . . ,s. By construction, 
F* < rt F < st F*. Theorem ffl yields 

H/(MF*,a 2 ;,4 n ) < ^(MF,a 2 ;4 n ) < lf(MF, a 2 ; A n ). 

Note that if A ~ F* then A - e ~ F*. Therefore the bandits (MF*,a 2 ; A n ) and (MF*,a 2 ; A n ) 
can be coupled in an obvious way such that, for every strategy of {MF* , a 2 ;A n ), there exists a 
strategy of (M F* , a 2 ; A n ) under which the payoff at each stage is either the same (when arm 2 
is selected), or exactly e less (when arm 1 is selected). Thus we have shown 

n 

W(MF*,a 2 ; A n ) - W(MF„a 2 ; A n ) < e a*. 

i=i 

Hence W(MF*, a 2 ;A n ) -)• W(MF, a 2 ; A n ) as e ->• 0, and the monotonicity of W(MF*,a 2 ; A n ) 
with respect to M implies the corresponding monotonicity of W(MF,a 2 ;A n ). 
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(iii) Finally, assume F is an arbitrary distribution with a finite mean. Suppose X ~ F. For 
L > let F* be the distribution of X* , defined as X if |X| < L and otherwise. We construct 
a coupling between (MF, 02; A n ) and (MF* , 02; ^4 n )- Let X& be the resulting observation when 
arm 1 of (MF, ct2', A n ) is pulled for the fcth time. If \Xi\ < L then let X* = X\, otherwise 



XI = 0, yielding X{ 
if |^fc+i| < L and 



given Xi, i = 1, 



F*. For general fc > 1, if \Xi\ < L, i = 1, . . . , k, then let X^ +1 = Xk + i 
otherwise. In this case the conditional distribution of X^+i 
k, is (MF + Ya=i SXi)/(M + k). Since \Xi\ < L, i = 1, . . . , k, we have 
Xi, i = l,...,/c, and the conditional distribution of X£ +1 given X* , i = is 
precisely (MF* + Yli=i^x*) / (M + k). That is, X* , i = l,...,k + 1, can be regarded as 
successive pulls from arm 1 of (MF*, 02] A n ) as long as |_Xj| < L, i = 1, . . . , k. Let the A;th pull 
from arm 2 be for both bandits. In the event that all \Xi\ < L, i = 1, . . . ,n, the optimal 
strategy for (MF, a^; A 2 ) can be adopted for (MF* ,02; A 2 ) throughout, yielding identical pulls 
(not all Xi, i = 1, . . . , n, are realized). By considering a trivial upper (respectively, lower) bound 
for the payoff of (MF, ot2', A2) (respectively, (MF* , 0C2; A2)) when at least one \Xi\ > L, we have 



W(MF, a 2 ; A 2 ) - W(MF*,a 2 ; A 2 ) < E 



< E 



< E 



1 U? =1 {|X i |>L} Yl ( Q ^ Yi \ + \ X M - a i(~\ Y i\ - L )) 
i=l 
n 

1 U^ 1 {|X 1 |>L}J^ a *( 2 l yi l + \ Xi \ +L ) 



i=l 



E^l^l^} )Y^a*{2\Y i \ + \X i \+L) 



L \i=l 

a*h(L), 



i=l 



where a* = max™ =1 aj. Direct calculation using exchangeability yields 

h(L) = n 2 Pr(|X!| > L)(2E\Y 1 \ + L) + nE [1| Xi |>l|^i|] + n(n - l)E [l| Xl | >L |X 2 |] 

The first two terms tend to zero as L — > 00 by dominated convergence since < 00. For 

the last term, by conditioning on X\ we have 



E [1|Xi|>lI^2|] = E 
which also vanishes as L — >■ 00. Thus 



|Xl|>L \M + 1 



M -E\X\ + ^-\Xi 



M + 1' 



limsup [W(MF,a 2 ;A 2 ) 

L—>oo 



W(MF*,a 2 ;A 2 )} < 0. 



By a parallel argument, we get liminfi^oo [W(MF, a 2 ; A 2 ) - W(MF*, a 2 ; A 2 )] > 0. Thus 
W(MF*,a 2 ;A 2 ) tends to W(MF,a 2 ; A 2 ) as L -> 00, and the monotonicity of W(MF,a 2 ;A n ) 
with respect to M is proved as before. □ 

Remark 2. Clayton and Berry (1985) also conjecture that the monotonicity in Corollary [2] 
is strict if n > 2, A n = (1, 1, ... , 1), and F is nondegenerate. This can be confirmed by a careful 
analysis of the above results. Some modifications are needed. Using arguments similar to steps 
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(ii) and (iii) in the proof of Theorem O we can first establish that Lemma 0] holds without the 
finite support restriction. Directly applying this strengthened Lemma 0] shows that (jlOp holds 
with strict inequality assuming n > 2, A n = (1, 1, . . . , 1), F is nondegenerate, and arm 1 is 
optimal initially in (MF, a<i~, A n ). Under such conditions, the strictness of the inequality holds 
by induction as one key step (|14j) holds with strict inequality. It follows that Corollary [2] can be 
strengthened to strict monotonicity assuming uniform discounting, n > 2, and a nondegenerate 
F. 
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